The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition.
The challenge is to construct a model that can predict the level of risk associated with an individual loan. With this project, we intend to use historical loan application data to predict whether or not a borrower will be able to repay a loan.
After phase 1 and 2 implementation, we realized that it is best to add a few more models to the algorithm to compare and find the best fitting model as we were facing underfitting and overfitting with naive bayes and random forest models. Therefore, for the final phase, we have implemented all of the following in this project:
In phase 1, we faced issues related to data size, unwanted data and lack of data tuning. In phase 2, our main goal is to add feature engineering and hyperparameter tuning to the phase 1 algorithm. In phase 3, we will implement the other mentioned models and also implement neural networks.
The results for this project are as follows:
Dataset link: https://www.kaggle.com/c/home-credit-default-risk/data
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
Eliminating the missing data: To check for any Kind of Null values/missing values in our data and to understand the type of values/datatype|featulespresentin the dataset values as zeroes.
Data joining/merging: Joining the features that have high correlation.
Best feature extraction: Extracting the top important features for the model pipelining and hyperparameter tuning.
Implementation of additional models: Implementation of AdaBoost, Bagging and XGBoost.
Implementation of neural netwroks: Implementing neural networks to check if it gives a better accuracy compared to other used models.
Kaggle submission: Get kaggle submission scores for each model used to see which is the best fitting model.
Importing all the necessary python libraries:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
Reading the csv data files:
df_train = pd.read_csv('/Users/athulyaanand/Downloads/application_train.csv')
df_test = pd.read_csv('/Users/athulyaanand/Downloads/application_test.csv')
bureau = pd.read_csv('/Users/athulyaanand/Downloads/bureau.csv')
bureau_balance = pd.read_csv('/Users/athulyaanand/Downloads/bureau_balance.csv')
install_payment = pd.read_csv('/Users/athulyaanand/Downloads/installments_payments.csv')
pos_cash_balance = pd.read_csv('/Users/athulyaanand/Downloads/POS_CASH_balance.csv')
previous_application = pd.read_csv('/Users/athulyaanand/Downloads/previous_application.csv')
credit_card_balance = pd.read_csv('/Users/athulyaanand/Downloads/credit_card_balance.csv')
install_payment = install_payment.groupby('SK_ID_CURR').median()
pos_cash_balance = pos_cash_balance.groupby('SK_ID_CURR').median()
previous_application = previous_application.groupby('SK_ID_CURR').median()
credit_card_balance = credit_card_balance.groupby('SK_ID_CURR').median()
EDA, or exploratory data analysis, is an essential component of any Data Analysis or Data Science project. Essentially, EDA entails analyzing the dataset to identify patterns, anomalies (outliers), and hypotheses based on our understanding of the dataset.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate.
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
head(): This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].
describe(): Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
info(): Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.
print ("\nhead():\n")
df_train.head()
head():
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
print ("\ndescribe():\n")
df_train.describe()
describe():
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
print ("\ninfo():\n")
df_train.info()
info(): <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
print ("\nhead():\n")
df_test.head()
print(df_test.columns)
head():
Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
'AMT_ANNUITY', 'AMT_GOODS_PRICE',
...
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object', length=121)
print ("\ndescribe():\n")
df_test.describe()
describe():
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
print ("\ninfo():\n")
df_test.info()
info(): <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB
By creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset. The new reduced set of features will be able to summarize much of the information that was contained in the original set of features. Thus, an abridged version of the original features can be created by combining them.
In our analysis of the data, we found that there are many missing values. Columns with more than 25% of missing values were removed. Our team checked the columns for the distribution of 0's and removed the columns with 85% of rows with only 0's. In addition, we divided the data into numerical and categorical data. The numerical data was handled by creating an intermediate imputer pipeline in which the missing values were replaced with the mean of the data, while the missing values in categorical missing data were handled by encoding the data based upon OHE (One Hot Encoding) and replacing the missing values with the mode of the columns.
Firstly, let's find the percentage of the missing values in each column:
def missing_percentage(df):
missing_values = df.isnull().sum(axis=0)*100/len(df)
return missing_values.sort_values(ascending=False)
Using 25% as the missing treshold value, we extract all the columns that has missing percentage less than the treshold value:
def missing(df,n):
new_df = missing_percentage(df).reset_index()
categorical_ = []
new_df.columns = ['index','flag']
final_df = []
for row in new_df.itertuples():
try:
final_df.append([row.index,row.flag,df[row.index].median(),df[row.index].mean(), df[row.index].nunique()])
except:
final_df.append([row.index,row.flag,df[row.index].mode(),'NA',df[row.index].nunique()])
columns = ['col_name','percentage_missing','median/Mode','mean','no_of_unique_values']
temp = pd.DataFrame(final_df,columns=columns)
return temp[temp['percentage_missing']>n]
df_25 = missing(df_train,25)
df_analysis = missing(df_train,0)
plt.figure(figsize=[10,40])
plt.plot(df_analysis['percentage_missing'],df_analysis['col_name'] ,color = 'b')
plt.title('A line plot showing the percentage missing of each column')
df_25
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | COMMONAREA_MEDI | 69.872297 | 0.0208 | 0.044595 | 3202 |
| 1 | COMMONAREA_AVG | 69.872297 | 0.0211 | 0.044621 | 3181 |
| 2 | COMMONAREA_MODE | 69.872297 | 0.019 | 0.042553 | 3128 |
| 3 | NONLIVINGAPARTMENTS_MODE | 69.432963 | 0.0 | 0.008076 | 167 |
| 4 | NONLIVINGAPARTMENTS_AVG | 69.432963 | 0.0 | 0.008809 | 386 |
| 5 | NONLIVINGAPARTMENTS_MEDI | 69.432963 | 0.0 | 0.008651 | 214 |
| 6 | FONDKAPREMONT_MODE | 68.386172 | 0 reg oper account dtype: object | NA | 4 |
| 7 | LIVINGAPARTMENTS_MODE | 68.354953 | 0.0771 | 0.105645 | 736 |
| 8 | LIVINGAPARTMENTS_AVG | 68.354953 | 0.0756 | 0.100775 | 1868 |
| 9 | LIVINGAPARTMENTS_MEDI | 68.354953 | 0.0761 | 0.101954 | 1097 |
| 10 | FLOORSMIN_AVG | 67.848630 | 0.2083 | 0.231894 | 305 |
| 11 | FLOORSMIN_MODE | 67.848630 | 0.2083 | 0.228058 | 25 |
| 12 | FLOORSMIN_MEDI | 67.848630 | 0.2083 | 0.231625 | 47 |
| 13 | YEARS_BUILD_MEDI | 66.497784 | 0.7585 | 0.755746 | 151 |
| 14 | YEARS_BUILD_MODE | 66.497784 | 0.7648 | 0.759637 | 154 |
| 15 | YEARS_BUILD_AVG | 66.497784 | 0.7552 | 0.752471 | 149 |
| 16 | OWN_CAR_AGE | 65.990810 | 9.0 | 12.061091 | 62 |
| 17 | LANDAREA_MEDI | 59.376738 | 0.0487 | 0.067169 | 3560 |
| 18 | LANDAREA_MODE | 59.376738 | 0.0458 | 0.064958 | 3563 |
| 19 | LANDAREA_AVG | 59.376738 | 0.0481 | 0.066333 | 3527 |
| 20 | BASEMENTAREA_MEDI | 58.515956 | 0.0758 | 0.087955 | 3772 |
| 21 | BASEMENTAREA_AVG | 58.515956 | 0.0763 | 0.088442 | 3780 |
| 22 | BASEMENTAREA_MODE | 58.515956 | 0.0746 | 0.087543 | 3841 |
| 23 | EXT_SOURCE_1 | 56.381073 | 0.505998 | 0.50213 | 114584 |
| 24 | NONLIVINGAREA_MODE | 55.179164 | 0.0011 | 0.027022 | 3327 |
| 25 | NONLIVINGAREA_AVG | 55.179164 | 0.0036 | 0.028358 | 3290 |
| 26 | NONLIVINGAREA_MEDI | 55.179164 | 0.0031 | 0.028236 | 3323 |
| 27 | ELEVATORS_MEDI | 53.295980 | 0.0 | 0.078078 | 46 |
| 28 | ELEVATORS_AVG | 53.295980 | 0.0 | 0.078942 | 257 |
| 29 | ELEVATORS_MODE | 53.295980 | 0.0 | 0.07449 | 26 |
| 30 | WALLSMATERIAL_MODE | 50.840783 | 0 Panel dtype: object | NA | 7 |
| 31 | APARTMENTS_MEDI | 50.749729 | 0.0864 | 0.11785 | 1148 |
| 32 | APARTMENTS_AVG | 50.749729 | 0.0876 | 0.11744 | 2339 |
| 33 | APARTMENTS_MODE | 50.749729 | 0.084 | 0.114231 | 760 |
| 34 | ENTRANCES_MEDI | 50.348768 | 0.1379 | 0.149213 | 46 |
| 35 | ENTRANCES_AVG | 50.348768 | 0.1379 | 0.149725 | 285 |
| 36 | ENTRANCES_MODE | 50.348768 | 0.1379 | 0.145193 | 30 |
| 37 | LIVINGAREA_AVG | 50.193326 | 0.0745 | 0.107399 | 5199 |
| 38 | LIVINGAREA_MODE | 50.193326 | 0.0731 | 0.105975 | 5301 |
| 39 | LIVINGAREA_MEDI | 50.193326 | 0.0749 | 0.108607 | 5281 |
| 40 | HOUSETYPE_MODE | 50.176091 | 0 block of flats dtype: object | NA | 3 |
| 41 | FLOORSMAX_MODE | 49.760822 | 0.1667 | 0.222315 | 25 |
| 42 | FLOORSMAX_MEDI | 49.760822 | 0.1667 | 0.225897 | 49 |
| 43 | FLOORSMAX_AVG | 49.760822 | 0.1667 | 0.226282 | 403 |
| 44 | YEARS_BEGINEXPLUATATION_MODE | 48.781019 | 0.9816 | 0.977065 | 221 |
| 45 | YEARS_BEGINEXPLUATATION_MEDI | 48.781019 | 0.9816 | 0.977752 | 245 |
| 46 | YEARS_BEGINEXPLUATATION_AVG | 48.781019 | 0.9816 | 0.977735 | 285 |
| 47 | TOTALAREA_MODE | 48.268517 | 0.0688 | 0.102547 | 5116 |
| 48 | EMERGENCYSTATE_MODE | 47.398304 | 0 No dtype: object | NA | 2 |
| 49 | OCCUPATION_TYPE | 31.345545 | 0 Laborers dtype: object | NA | 18 |
To optimize the data, we check each column for all the zero or null values and if 85% or more of the data in that column is filled with zero or null, we remove that particular column:
df_zero = pd.DataFrame()
columns = []
percentage =[]
for col in df_train.columns:
if col == 'TARGET':
continue
count = (df_train[col] == 0).sum()
columns.append(col)
percentage.append(count/len(df_train[col]))
df_zero['Column'] = columns
df_zero['Percentage'] = percentage
per = 85/100
df_zero = df_zero[df_zero['Percentage']>per]
Printing all the columns that contain at least 85% of its data as either zero or null:
more_than_85 = df_zero
more_than_85
| Column | Percentage | |
|---|---|---|
| 26 | FLAG_EMAIL | 0.943280 |
| 33 | REG_REGION_NOT_LIVE_REGION | 0.984856 |
| 34 | REG_REGION_NOT_WORK_REGION | 0.949231 |
| 35 | LIVE_REGION_NOT_WORK_REGION | 0.959341 |
| 36 | REG_CITY_NOT_LIVE_CITY | 0.921827 |
| 91 | DEF_30_CNT_SOCIAL_CIRCLE | 0.882323 |
| 93 | DEF_60_CNT_SOCIAL_CIRCLE | 0.912881 |
| 95 | FLAG_DOCUMENT_2 | 0.999958 |
| 97 | FLAG_DOCUMENT_4 | 0.999919 |
| 98 | FLAG_DOCUMENT_5 | 0.984885 |
| 99 | FLAG_DOCUMENT_6 | 0.911945 |
| 100 | FLAG_DOCUMENT_7 | 0.999808 |
| 101 | FLAG_DOCUMENT_8 | 0.918624 |
| 102 | FLAG_DOCUMENT_9 | 0.996104 |
| 103 | FLAG_DOCUMENT_10 | 0.999977 |
| 104 | FLAG_DOCUMENT_11 | 0.996088 |
| 105 | FLAG_DOCUMENT_12 | 0.999993 |
| 106 | FLAG_DOCUMENT_13 | 0.996475 |
| 107 | FLAG_DOCUMENT_14 | 0.997064 |
| 108 | FLAG_DOCUMENT_15 | 0.998790 |
| 109 | FLAG_DOCUMENT_16 | 0.990072 |
| 110 | FLAG_DOCUMENT_17 | 0.999733 |
| 111 | FLAG_DOCUMENT_18 | 0.991870 |
| 112 | FLAG_DOCUMENT_19 | 0.999405 |
| 113 | FLAG_DOCUMENT_20 | 0.999493 |
| 114 | FLAG_DOCUMENT_21 | 0.999665 |
| 115 | AMT_REQ_CREDIT_BUREAU_HOUR | 0.859696 |
| 116 | AMT_REQ_CREDIT_BUREAU_DAY | 0.860142 |
Dropping all the columns that contain at least 85% of its data as either zero or null:
df_train.drop(columns = more_than_85['Column'],inplace = True)
df_train =df_train[df_train['NAME_FAMILY_STATUS']!='Unknown']
df_train =df_train[df_train['CODE_GENDER']!='XNA']
df_train =df_train[df_train['NAME_INCOME_TYPE']!='Maternity leave']
Saving all the training data targets in the numerical dataframe:
df_numerical= df_train.select_dtypes(exclude='object')
df_numerical['TARGET'] = df_train['TARGET']
df_categorical= df_train.select_dtypes(include='object')
<ipython-input-18-0c726e26b403>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_numerical['TARGET'] = df_train['TARGET']
df_numerical.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307500.000000 | 307500.000000 | 307500.000000 | 3.075000e+05 | 3.075000e+05 | 307488.000000 | 3.072240e+05 | 307500.000000 | 307500.000000 | 307500.000000 | ... | 137824.000000 | 159074.000000 | 306479.000000 | 306479.000000 | 307499.000000 | 307500.000000 | 265986.000000 | 265986.000000 | 265986.000000 | 265986.000000 |
| mean | 278181.087798 | 0.080725 | 0.417034 | 1.687971e+05 | 5.990259e+05 | 27108.477604 | 5.383943e+05 | 0.020868 | -16037.069246 | 63817.429333 | ... | 0.028237 | 0.102548 | 1.422202 | 1.405248 | -962.865681 | 0.710049 | 0.034363 | 0.267390 | 0.265476 | 1.899961 |
| std | 102789.822017 | 0.272413 | 0.722108 | 2.371263e+05 | 4.024936e+05 | 14493.600189 | 3.694459e+05 | 0.013831 | 4363.988872 | 141277.730537 | ... | 0.070168 | 0.107464 | 2.400947 | 2.379760 | 826.813694 | 0.453740 | 0.204687 | 0.915997 | 0.794062 | 1.869288 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189146.750000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.041200 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.500000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.003100 | 0.068800 | 0.000000 | 0.000000 | -757.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367143.250000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.026600 | 0.127600 | 2.000000 | 2.000000 | -274.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 348.000000 | 344.000000 | 0.000000 | 1.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 78 columns
df_categorical.describe()
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307500 | 307500 | 307500 | 307500 | 306210 | 307500 | 307500 | 307500 | 307500 | 211112 | 307500 | 307500 | 97211 | 153208 | 151164 | 161750 |
| unique | 2 | 2 | 2 | 2 | 7 | 7 | 5 | 5 | 6 | 18 | 7 | 58 | 4 | 3 | 7 | 2 |
| top | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | Laborers | TUESDAY | Business Entity Type 3 | reg oper account | block of flats | Panel | No |
| freq | 278230 | 202443 | 202916 | 213302 | 248520 | 158771 | 218387 | 196424 | 272859 | 55185 | 53898 | 67992 | 73827 | 150497 | 66037 | 159422 |
Checking co-relations of each numerical data which is greater than 3% for both positive and negative co-relation:
correlation = df_numerical.corr()['TARGET'].sort_values(ascending = False).reset_index()
correlation.columns = ['col_name','Correlation']
after_correlation = correlation[abs(correlation['Correlation'])>0.03]
after_correlation
| col_name | Correlation | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | DAYS_BIRTH | 0.078236 |
| 2 | REGION_RATING_CLIENT_W_CITY | 0.060875 |
| 3 | REGION_RATING_CLIENT | 0.058882 |
| 4 | DAYS_LAST_PHONE_CHANGE | 0.055228 |
| 5 | DAYS_ID_PUBLISH | 0.051455 |
| 6 | REG_CITY_NOT_WORK_CITY | 0.050981 |
| 7 | FLAG_EMP_PHONE | 0.045978 |
| 8 | FLAG_DOCUMENT_3 | 0.044371 |
| 9 | DAYS_REGISTRATION | 0.041950 |
| 10 | OWN_CAR_AGE | 0.037625 |
| 11 | LIVE_CITY_NOT_WORK_CITY | 0.032500 |
| 58 | AMT_CREDIT | -0.030390 |
| 59 | LIVINGAREA_MODE | -0.030688 |
| 60 | ELEVATORS_MODE | -0.032132 |
| 61 | TOTALAREA_MODE | -0.032600 |
| 62 | FLOORSMIN_MODE | -0.032700 |
| 63 | LIVINGAREA_MEDI | -0.032743 |
| 64 | LIVINGAREA_AVG | -0.033001 |
| 65 | FLOORSMIN_MEDI | -0.033397 |
| 66 | FLOORSMIN_AVG | -0.033616 |
| 67 | ELEVATORS_MEDI | -0.033864 |
| 68 | ELEVATORS_AVG | -0.034200 |
| 69 | REGION_POPULATION_RELATIVE | -0.037223 |
| 70 | AMT_GOODS_PRICE | -0.039671 |
| 71 | FLOORSMAX_MODE | -0.043228 |
| 72 | FLOORSMAX_MEDI | -0.043770 |
| 73 | FLOORSMAX_AVG | -0.044005 |
| 74 | DAYS_EMPLOYED | -0.044927 |
| 75 | EXT_SOURCE_1 | -0.155333 |
| 76 | EXT_SOURCE_2 | -0.160451 |
| 77 | EXT_SOURCE_3 | -0.178926 |
Checking for data that contains no missing values in the categorical dataframe:
df_temp = missing(df_categorical,0)
df_temp
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | FONDKAPREMONT_MODE | 68.386667 | 0 reg oper account dtype: object | NA | 4 |
| 1 | WALLSMATERIAL_MODE | 50.840976 | 0 Panel dtype: object | NA | 7 |
| 2 | HOUSETYPE_MODE | 50.176260 | 0 block of flats dtype: object | NA | 3 |
| 3 | EMERGENCYSTATE_MODE | 47.398374 | 0 No dtype: object | NA | 2 |
| 4 | OCCUPATION_TYPE | 31.345691 | 0 Laborers dtype: object | NA | 18 |
| 5 | NAME_TYPE_SUITE | 0.419512 | 0 Unaccompanied dtype: object | NA | 7 |
Dropping all the categorical dataframes that have missing values:
column_remove = ['FONDKAPREMONT_MODE','WALLSMATERIAL_MODE','HOUSETYPE_MODE','EMERGENCYSTATE_MODE','OCCUPATION_TYPE']
df_categorical.drop(columns = column_remove,inplace=True)
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().drop(
In order to obtain a deeper understanding of the data, EDA involves generating summary statistics based on numerical data and creating various graphical representations in order to better understand the data. Data Visualization represents the text or numerical data in a visual format, which makes it easy to grasp the information the data express. We, humans, remember the pictures more easily than readable text, so Python provides us various libraries for data visualization like matplotlib, seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for performing various techniques to explore data using various plots.
Numerical data refers to the data that is in the form of numbers, and not in any language or descriptive form.
Alias: Also known as quantitative data as it represents quantitative values to perform arithmetic operations on them.
Type: Discrete data and Continuous data.
Analysis: Descriptive and inferential statistics.
Uses: Used for statistical calculations as a result of the potential performance of arithmetic operations.
Structures: It is structured data and can be quickly organized and made sense of.
df_numerical.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 307500 entries, 0 to 307510 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307500 non-null int64 1 TARGET 307500 non-null int64 2 CNT_CHILDREN 307500 non-null int64 3 AMT_INCOME_TOTAL 307500 non-null float64 4 AMT_CREDIT 307500 non-null float64 5 AMT_ANNUITY 307488 non-null float64 6 AMT_GOODS_PRICE 307224 non-null float64 7 REGION_POPULATION_RELATIVE 307500 non-null float64 8 DAYS_BIRTH 307500 non-null int64 9 DAYS_EMPLOYED 307500 non-null int64 10 DAYS_REGISTRATION 307500 non-null float64 11 DAYS_ID_PUBLISH 307500 non-null int64 12 OWN_CAR_AGE 104579 non-null float64 13 FLAG_MOBIL 307500 non-null int64 14 FLAG_EMP_PHONE 307500 non-null int64 15 FLAG_WORK_PHONE 307500 non-null int64 16 FLAG_CONT_MOBILE 307500 non-null int64 17 FLAG_PHONE 307500 non-null int64 18 CNT_FAM_MEMBERS 307500 non-null float64 19 REGION_RATING_CLIENT 307500 non-null int64 20 REGION_RATING_CLIENT_W_CITY 307500 non-null int64 21 HOUR_APPR_PROCESS_START 307500 non-null int64 22 REG_CITY_NOT_WORK_CITY 307500 non-null int64 23 LIVE_CITY_NOT_WORK_CITY 307500 non-null int64 24 EXT_SOURCE_1 134126 non-null float64 25 EXT_SOURCE_2 306840 non-null float64 26 EXT_SOURCE_3 246541 non-null float64 27 APARTMENTS_AVG 151444 non-null float64 28 BASEMENTAREA_AVG 127562 non-null float64 29 YEARS_BEGINEXPLUATATION_AVG 157498 non-null float64 30 YEARS_BUILD_AVG 103019 non-null float64 31 COMMONAREA_AVG 92644 non-null float64 32 ELEVATORS_AVG 143614 non-null float64 33 ENTRANCES_AVG 152677 non-null float64 34 FLOORSMAX_AVG 154485 non-null float64 35 FLOORSMIN_AVG 98866 non-null float64 36 LANDAREA_AVG 124917 non-null float64 37 LIVINGAPARTMENTS_AVG 97309 non-null float64 38 LIVINGAREA_AVG 153155 non-null float64 39 NONLIVINGAPARTMENTS_AVG 93994 non-null float64 40 NONLIVINGAREA_AVG 137824 non-null float64 41 APARTMENTS_MODE 151444 non-null float64 42 BASEMENTAREA_MODE 127562 non-null float64 43 YEARS_BEGINEXPLUATATION_MODE 157498 non-null float64 44 YEARS_BUILD_MODE 103019 non-null float64 45 COMMONAREA_MODE 92644 non-null float64 46 ELEVATORS_MODE 143614 non-null float64 47 ENTRANCES_MODE 152677 non-null float64 48 FLOORSMAX_MODE 154485 non-null float64 49 FLOORSMIN_MODE 98866 non-null float64 50 LANDAREA_MODE 124917 non-null float64 51 LIVINGAPARTMENTS_MODE 97309 non-null float64 52 LIVINGAREA_MODE 153155 non-null float64 53 NONLIVINGAPARTMENTS_MODE 93994 non-null float64 54 NONLIVINGAREA_MODE 137824 non-null float64 55 APARTMENTS_MEDI 151444 non-null float64 56 BASEMENTAREA_MEDI 127562 non-null float64 57 YEARS_BEGINEXPLUATATION_MEDI 157498 non-null float64 58 YEARS_BUILD_MEDI 103019 non-null float64 59 COMMONAREA_MEDI 92644 non-null float64 60 ELEVATORS_MEDI 143614 non-null float64 61 ENTRANCES_MEDI 152677 non-null float64 62 FLOORSMAX_MEDI 154485 non-null float64 63 FLOORSMIN_MEDI 98866 non-null float64 64 LANDAREA_MEDI 124917 non-null float64 65 LIVINGAPARTMENTS_MEDI 97309 non-null float64 66 LIVINGAREA_MEDI 153155 non-null float64 67 NONLIVINGAPARTMENTS_MEDI 93994 non-null float64 68 NONLIVINGAREA_MEDI 137824 non-null float64 69 TOTALAREA_MODE 159074 non-null float64 70 OBS_30_CNT_SOCIAL_CIRCLE 306479 non-null float64 71 OBS_60_CNT_SOCIAL_CIRCLE 306479 non-null float64 72 DAYS_LAST_PHONE_CHANGE 307499 non-null float64 73 FLAG_DOCUMENT_3 307500 non-null int64 74 AMT_REQ_CREDIT_BUREAU_WEEK 265986 non-null float64 75 AMT_REQ_CREDIT_BUREAU_MON 265986 non-null float64 76 AMT_REQ_CREDIT_BUREAU_QRT 265986 non-null float64 77 AMT_REQ_CREDIT_BUREAU_YEAR 265986 non-null float64 dtypes: float64(61), int64(17) memory usage: 185.3 MB
Here we are plotting graphs of some columns which are positively correlated with target variable and analayzing the trends
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='REGION_RATING_CLIENT_W_CITY',data=df_numerical,color= 'GREEN')
plt.title("REGION_RATING_CLIENT_W_CITY vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'REGION_RATING_CLIENT_W_CITY vs TARGET')
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='DAYS_BIRTH',data=df_numerical,color= 'GREEN')
plt.title("DAYS_BIRTH vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'DAYS_BIRTH vs TARGET')
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='REG_CITY_NOT_WORK_CITY',data=df_numerical,color= 'GREEN')
plt.title("REG_CITY_NOT_WORK_CITY vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'REG_CITY_NOT_WORK_CITY vs TARGET')
Here we are plotting graphs of some columns which are negatively correlated with target variable and analayzing the trends
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='EXT_SOURCE_3',data=df_numerical,color= 'RED')
plt.title("EXT_SOURCE_3 vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'EXT_SOURCE_3 vs TARGET')
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='DAYS_EMPLOYED',data=df_numerical,color= 'RED')
plt.title("DAYS_EMPLOYED vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'DAYS_EMPLOYED vs TARGET')
plt.figure(figsize=[10,8])
sb.pointplot(x='TARGET',y='AMT_CREDIT',data=df_numerical,color= 'RED')
plt.title("AMT_CREDIT vs TARGET", fontweight = 'bold', fontsize = 20)
Text(0.5, 1.0, 'AMT_CREDIT vs TARGET')
Plotting heatmap to analyze correlatoin in application train dataset
sb.heatmap(df_train.corr());
Plotting heatmap to see correlation in application test dataset
sb.heatmap(df_test.corr());
Categorical data refers to a data type that can be stored and identified based on the names or labels given to them.
Alias: Also known as qualitative data as it qualifies data before classifying it.
Type: Nominal data and Ordinal data.
Analysis: Median and mode.
Uses: Used when a study requires respondents’ personal information, opinions and experiences. Commonly used in business research.
Structures: Is known as unstructured or semi-structured data.
df_categorical.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 307500 entries, 0 to 307510 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NAME_CONTRACT_TYPE 307500 non-null object 1 CODE_GENDER 307500 non-null object 2 FLAG_OWN_CAR 307500 non-null object 3 FLAG_OWN_REALTY 307500 non-null object 4 NAME_TYPE_SUITE 306210 non-null object 5 NAME_INCOME_TYPE 307500 non-null object 6 NAME_EDUCATION_TYPE 307500 non-null object 7 NAME_FAMILY_STATUS 307500 non-null object 8 NAME_HOUSING_TYPE 307500 non-null object 9 WEEKDAY_APPR_PROCESS_START 307500 non-null object 10 ORGANIZATION_TYPE 307500 non-null object dtypes: object(11) memory usage: 36.2+ MB
eda_cat1 = df_categorical['FLAG_OWN_REALTY'].value_counts()
print(eda_cat1)
plt.figure(figsize=[10,8])
sb.countplot(df_categorical['FLAG_OWN_REALTY'],palette = 'Reds')
plt.title("Percentage of loan in accordance to REALTY", fontweight = 'bold', fontsize = 14)
Y 213302 N 94198 Name: FLAG_OWN_REALTY, dtype: int64
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to REALTY')
From this graph we can figure out that if a person owns a realty or not
eda_cat2 = df_categorical['NAME_INCOME_TYPE'].value_counts()
print(eda_cat2)
plt.figure(figsize=[15,10])
sb.countplot(df_categorical['NAME_INCOME_TYPE'],palette = 'Blues')
plt.title("Percentage of loan in accordance to Income Type", fontweight = 'bold', fontsize = 20)
Working 158771 Commercial associate 71614 Pensioner 55362 State servant 21703 Unemployed 22 Student 18 Businessman 10 Name: NAME_INCOME_TYPE, dtype: int64
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to Income Type')
From the above graph we can see that the most borrowing category are the people who are from the working class
eda_cat3 = df_categorical['NAME_CONTRACT_TYPE'].value_counts()
print(eda_cat3)
plt.figure(figsize=[10,8])
sb.countplot(df_categorical['NAME_CONTRACT_TYPE'],palette = 'Greens')
plt.title("Percentage of loan in accordance to Contract Type", fontweight = 'bold', fontsize = 16)
Cash loans 278230 Revolving loans 29270 Name: NAME_CONTRACT_TYPE, dtype: int64
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to Contract Type')
From the above graph we can see the type of loan people take
plt.figure(figsize=[10,8])
sb.countplot(df_categorical['NAME_FAMILY_STATUS'],palette = 'Purples')
plt.title('Percentage of Borrower Family Status', fontweight = 'bold', fontsize = 16)
plt.show()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
From the above pie chart we can that Married people tend borrow more money
plt.figure(figsize=[10,8])
#plt.pie(df_categorical['WEEKDAY_APPR_PROCESS_START'].value_counts(),explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05),labels = df_categorical['WEEKDAY_APPR_PROCESS_START'].value_counts().index,autopct='%1.1f%%')
sb.countplot(df_categorical['WEEKDAY_APPR_PROCESS_START'],palette = 'Oranges')
plt.title('Day-wise depiction of Percentage of loan approval process', fontweight = 'bold', fontsize = 16)
plt.show()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Now comes the fun part. Models are general rules in a statistical sense.Think of a machine learning model as tools in your toolbox. You will have access to many algorithms and use them to accomplish different goals. The better features you use the better your predictive power will be. After cleaning your data and finding what features are most important, using your model as a predictive tool will only enhance your model decision making.
Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction). A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.
The modeling pipeline is an important tool for machine learning practitioners. Nevertheless, there are important implications that must be considered when using them. The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.
Therefore, for this project we are going to use 3 different modeling pipeline methods to perform home credit default risk prediction and they are:
We will choose the model that gives the best accuracy for the home credit default risk prediction.
We will be using the following pipeline for this project:
Importing all the necessary python libraries for the different pipelines we are going to use:
import warnings
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer, make_column_transformer
warnings.filterwarnings("ignore", category=DeprecationWarning)
Selecting only the columns that we have finally decided for the numerical and the categorical part:
df_cols = list(df_numerical.columns)+ list(df_categorical.columns)
df_i = df_train[df_cols]
Here for pipeline for numerical data(numerical_pipeline) we are imputing the missing values by mean of the column and for pipeline for categorical data(categorical_pipeline) we are imputing missing values by most frequent data and implementing one hot encoding for categorical pipeline to deal with categorical data.
Then we have created a pipeline to merge numerical and categorical pipelines using ColumnTransformer.
numerical_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean'))])
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))])
data_pipeline = ColumnTransformer([
("num_pipeline", numerical_pipeline, df_numerical.columns),
("cat_pipeline", categorical_pipeline, df_categorical.columns)], n_jobs = -1)
df_transformed = data_pipeline.fit_transform(df_i)
final_column_names = list(df_numerical.columns) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(df_categorical.columns))
Saving transformed dataset used for the model training:
df_final = pd.DataFrame(df_transformed, columns=final_column_names)
df_temp = df_final
df_final
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | ORGANIZATION_TYPE_Trade: type 4 | ORGANIZATION_TYPE_Trade: type 5 | ORGANIZATION_TYPE_Trade: type 6 | ORGANIZATION_TYPE_Trade: type 7 | ORGANIZATION_TYPE_Transport: type 1 | ORGANIZATION_TYPE_Transport: type 2 | ORGANIZATION_TYPE_Transport: type 3 | ORGANIZATION_TYPE_Transport: type 4 | ORGANIZATION_TYPE_University | ORGANIZATION_TYPE_XNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002.0 | 1.0 | 0.0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461.0 | -637.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100003.0 | 0.0 | 0.0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765.0 | -1188.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004.0 | 0.0 | 0.0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046.0 | -225.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006.0 | 0.0 | 0.0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005.0 | -3039.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 100007.0 | 0.0 | 0.0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932.0 | -3038.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307495 | 456251.0 | 0.0 | 0.0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | 0.032561 | -9327.0 | -236.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307496 | 456252.0 | 0.0 | 0.0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | 0.025164 | -20775.0 | 365243.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 307497 | 456253.0 | 0.0 | 0.0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | 0.005002 | -14966.0 | -7921.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307498 | 456254.0 | 1.0 | 0.0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | 0.005313 | -11961.0 | -4786.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307499 | 456255.0 | 0.0 | 0.0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | 0.046220 | -16856.0 | -1262.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
307500 rows × 181 columns
We are merging the different dataframes together according to the data diagram shown in the data description section above using primary keys:
previous_application = pd.merge(left=previous_application, right=pos_cash_balance, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
previous_application = pd.merge(left=previous_application, right=install_payment, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
previous_application = pd.merge(left=previous_application, right=credit_card_balance, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_final = pd.merge(left=df_final, right=previous_application, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
bureau = pd.merge(left=bureau, right=bureau_balance, how='left', left_on='SK_ID_BUREAU', right_on='SK_ID_BUREAU')
df_final
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD_y | SK_DPD_DEF_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002.0 | 1.0 | 0.0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461.0 | -637.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100003.0 | 0.0 | 0.0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765.0 | -1188.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 100004.0 | 0.0 | 0.0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046.0 | -225.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 100006.0 | 0.0 | 0.0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005.0 | -3039.0 | ... | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 |
| 4 | 100007.0 | 0.0 | 0.0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932.0 | -3038.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307495 | 456251.0 | 0.0 | 0.0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | 0.032561 | -9327.0 | -236.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307496 | 456252.0 | 0.0 | 0.0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | 0.025164 | -20775.0 | 365243.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307497 | 456253.0 | 0.0 | 0.0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | 0.005002 | -14966.0 | -7921.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307498 | 456254.0 | 1.0 | 0.0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | 0.005313 | -11961.0 | -4786.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 307499 | 456255.0 | 0.0 | 0.0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | 0.046220 | -16856.0 | -1262.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
307500 rows × 235 columns
We are merging training dataset with the bureau after cleaning the bureau dataset:
bureau = bureau.drop_duplicates()
bureau.shape
bureau = bureau.groupby(['SK_ID_CURR','SK_ID_BUREAU']).min()
clean_bureau = bureau[['DAYS_CREDIT','DAYS_ENDDATE_FACT','AMT_CREDIT_SUM','DAYS_CREDIT_UPDATE','MONTHS_BALANCE']]
clean_bureau = clean_bureau.reset_index()
clean_bureau = clean_bureau.groupby('SK_ID_CURR').median()
clean_bureau = clean_bureau.reset_index()
df_final = pd.merge(left=df_final, right=clean_bureau, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_final.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD_y | SK_DPD_DEF_y | SK_ID_BUREAU | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002.0 | 1.0 | 0.0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461.0 | -637.0 | ... | NaN | NaN | NaN | NaN | 6158905.5 | -1042.5 | -939.0 | 54130.50 | -402.5 | -34.0 |
| 1 | 100003.0 | 0.0 | 0.0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765.0 | -1188.0 | ... | NaN | NaN | NaN | NaN | 5885878.5 | -1205.5 | -621.0 | 92576.25 | -545.0 | NaN |
| 2 | 100004.0 | 0.0 | 0.0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046.0 | -225.0 | ... | NaN | NaN | NaN | NaN | 6829133.5 | -867.0 | -532.5 | 94518.90 | -532.0 | NaN |
| 3 | 100006.0 | 0.0 | 0.0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005.0 | -3039.0 | ... | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007.0 | 0.0 | 0.0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932.0 | -3038.0 | ... | NaN | NaN | NaN | NaN | 5987200.0 | -1149.0 | -783.0 | 146250.00 | -783.0 | NaN |
5 rows × 241 columns
clean_bureau
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|
| 0 | 100001 | 5896633.0 | -857.0 | -715.0 | 168345.00 | -155.0 | -28.0 |
| 1 | 100002 | 6158905.5 | -1042.5 | -939.0 | 54130.50 | -402.5 | -34.0 |
| 2 | 100003 | 5885878.5 | -1205.5 | -621.0 | 92576.25 | -545.0 | NaN |
| 3 | 100004 | 6829133.5 | -867.0 | -532.5 | 94518.90 | -532.0 | NaN |
| 4 | 100005 | 6735201.0 | -137.0 | -123.0 | 58500.00 | -31.0 | -4.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 305806 | 456249 | 5371700.0 | -1680.0 | -1279.0 | 248692.50 | -909.0 | NaN |
| 305807 | 456250 | 6817237.0 | -824.0 | -760.0 | 483349.50 | -31.0 | -27.0 |
| 305808 | 456253 | 6098498.5 | -919.0 | -794.0 | 675000.00 | -153.5 | -30.0 |
| 305809 | 456254 | 6669849.0 | -1104.0 | -859.0 | 45000.00 | -401.0 | -36.0 |
| 305810 | 456255 | 5126332.0 | -1020.0 | -869.5 | 436032.00 | -700.0 | -33.0 |
305811 rows × 7 columns
Defining a function to find target correlation with the other features:
def corr_target(df,cor):
correlation = df.corr()['TARGET'].sort_values(ascending=False).reset_index()
correlation.columns = ['col_name','Correlation']
after_correlation = correlation[abs(correlation['Correlation'])>cor]
return after_correlation
df_final_features = df_final
df_final_features.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD_y | SK_DPD_DEF_y | SK_ID_BUREAU | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002.0 | 1.0 | 0.0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461.0 | -637.0 | ... | NaN | NaN | NaN | NaN | 6158905.5 | -1042.5 | -939.0 | 54130.50 | -402.5 | -34.0 |
| 1 | 100003.0 | 0.0 | 0.0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765.0 | -1188.0 | ... | NaN | NaN | NaN | NaN | 5885878.5 | -1205.5 | -621.0 | 92576.25 | -545.0 | NaN |
| 2 | 100004.0 | 0.0 | 0.0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046.0 | -225.0 | ... | NaN | NaN | NaN | NaN | 6829133.5 | -867.0 | -532.5 | 94518.90 | -532.0 | NaN |
| 3 | 100006.0 | 0.0 | 0.0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005.0 | -3039.0 | ... | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007.0 | 0.0 | 0.0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932.0 | -3038.0 | ... | NaN | NaN | NaN | NaN | 5987200.0 | -1149.0 | -783.0 | 146250.00 | -783.0 | NaN |
5 rows × 241 columns
We are additionally adding a few self-made features to the training dataset and they are as follows:
df_final_features['FEATURE1']= df_final_features['AMT_TOTAL_RECEIVABLE']/(df_final_features['AMT_BALANCE']+1)
df_final_features['FEATURE2'] = df_final_features['AMT_TOTAL_RECEIVABLE']/(df_final_features['AMT_RECIVABLE']+1)
df_final_features['FEATURE3'] = df_final_features['AMT_TOTAL_RECEIVABLE']/(df_final_features['AMT_RECEIVABLE_PRINCIPAL']+1)
df_final_features['FEATURE4']=df_final_features['AMT_CREDIT_x'] / (df_final_features['AMT_INCOME_TOTAL']+1)
df_final_features['FEATURE5']=df_final_features['AMT_ANNUITY_x'] / (df_final_features['AMT_INCOME_TOTAL']+1)
df_final_features['FEATURE6']= df_final_features['AMT_ANNUITY_x'] / (df_final_features['AMT_CREDIT_x'] +1)
df_final_features['FEATURE7']=(df_final_features['EXT_SOURCE_1']*df_final_features['EXT_SOURCE_2']*df_final_features['EXT_SOURCE_3'])
df_final_features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | FEATURE1 | FEATURE2 | FEATURE3 | FEATURE4 | FEATURE5 | FEATURE6 | FEATURE7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002.0 | 1.0 | 0.0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461.0 | -637.0 | ... | 54130.50 | -402.5 | -34.0 | NaN | NaN | NaN | 2.007879 | 0.121977 | 0.060749 | 0.003043 |
| 1 | 100003.0 | 0.0 | 0.0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765.0 | -1188.0 | ... | 92576.25 | -545.0 | NaN | NaN | NaN | NaN | 4.790732 | 0.132216 | 0.027598 | 0.098945 |
| 2 | 100004.0 | 0.0 | 0.0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046.0 | -225.0 | ... | 94518.90 | -532.0 | NaN | NaN | NaN | NaN | 1.999970 | 0.099999 | 0.050000 | 0.203649 |
| 3 | 100006.0 | 0.0 | 0.0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005.0 | -3039.0 | ... | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 2.316150 | 0.219898 | 0.094941 | 0.166847 |
| 4 | 100007.0 | 0.0 | 0.0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932.0 | -3038.0 | ... | 146250.00 | -783.0 | NaN | NaN | NaN | NaN | 4.222187 | 0.179961 | 0.042623 | 0.082786 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307495 | 456251.0 | 0.0 | 0.0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | 0.032561 | -9327.0 | -236.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1.617133 | 0.174970 | 0.108197 | 0.050690 |
| 307496 | 456252.0 | 0.0 | 0.0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | 0.025164 | -20775.0 | 365243.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 3.743698 | 0.166685 | 0.044524 | 0.029753 |
| 307497 | 456253.0 | 0.0 | 0.0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | 0.005002 | -14966.0 | -7921.0 | ... | 675000.00 | -153.5 | -30.0 | NaN | NaN | NaN | 4.429148 | 0.195940 | 0.044239 | 0.087235 |
| 307498 | 456254.0 | 1.0 | 0.0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | 0.005313 | -11961.0 | -4786.0 | ... | 45000.00 | -401.0 | -36.0 | NaN | NaN | NaN | 2.164356 | 0.118157 | 0.054592 | 0.170659 |
| 307499 | 456255.0 | 0.0 | 0.0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | 0.046220 | -16856.0 | -1262.0 | ... | 436032.00 | -700.0 | -33.0 | NaN | NaN | NaN | 4.285687 | 0.311855 | 0.072767 | 0.059287 |
307500 rows × 248 columns
Finding the correlation between the newly made features and the target feature:
feature_correlations = df_final_features[['FEATURE1','FEATURE2','FEATURE3','FEATURE4','FEATURE5','FEATURE6','FEATURE7','TARGET']].corr()
feature_correlations
| FEATURE1 | FEATURE2 | FEATURE3 | FEATURE4 | FEATURE5 | FEATURE6 | FEATURE7 | TARGET | |
|---|---|---|---|---|---|---|---|---|
| FEATURE1 | 1.000000 | 0.048508 | 0.113956 | -0.000901 | -0.002146 | -0.003188 | 0.001960 | -0.000746 |
| FEATURE2 | 0.048508 | 1.000000 | 0.053074 | -0.049499 | -0.059499 | -0.013560 | -0.107162 | 0.097484 |
| FEATURE3 | 0.113956 | 0.053074 | 1.000000 | 0.002338 | -0.000230 | -0.005639 | -0.003305 | 0.000280 |
| FEATURE4 | -0.000901 | -0.049499 | 0.002338 | 1.000000 | 0.788086 | -0.522153 | 0.072455 | -0.007821 |
| FEATURE5 | -0.002146 | -0.059499 | -0.000230 | 0.788086 | 1.000000 | -0.029672 | 0.040208 | 0.014210 |
| FEATURE6 | -0.003188 | -0.013560 | -0.005639 | -0.522153 | -0.029672 | 1.000000 | -0.053857 | 0.012722 |
| FEATURE7 | 0.001960 | -0.107162 | -0.003305 | 0.072455 | 0.040208 | -0.053857 | 1.000000 | -0.189587 |
| TARGET | -0.000746 | 0.097484 | 0.000280 | -0.007821 | 0.014210 | 0.012722 | -0.189587 | 1.000000 |
sb.heatmap(feature_correlations.corr());
df_final_features = df_final_features.rename(columns = {'AMT_CREDIT_y':'AMT_CREDIT'})
df_final_features = df_final_features.apply(lambda x: x.fillna(x.median()),axis=0)
df_final_features.shape
(307500, 248)
We are shortlisting all the features with a correlation value of greater than 8% with respect to target:
corr_greater_than_5= corr_target(df_final_features,0.05)
corr_greater_than_5.shape
(18, 2)
selected_features = corr_greater_than_5['col_name']
new_features = df_final_features[selected_features]
models_results = []
Target = df_final_features['TARGET']
new_features
new_features.to_csv('final_application_train.csv')
Using the shortlisted features for the training dataset and spliting the whole dataset into training and testing datasets:
y=Target.values
X =new_features.drop(columns =[ 'TARGET']).values
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, random_state=98)
We have performed following experiments with different groups of features that we have newly created. With the hypermarameter tuning and the datasets including these features we have calculated the accuracies for different models. Then we have found the best group of features from these experiment.
Link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.
We can use GridSearch to tune the hyperparameters. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved. In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data. In cross-validation, the process divides the train data further into two parts – the train data and the validation data.
Library: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Naive Bayes is a basic but effective probabilistic classification model in machine learning that draws influence from Bayes Theorem.
Bayes theorem is a formula that offers a conditional probability of an event A taking happening given another event B has previously happened. Its mathematical formula is as follows: –
Where
A and B are two events P(A|B) is the probability of event A provided event B has already happened. P(B|A) is the probability of event B provided event A has already happened. P(A) is the independent probability of A P(B) is the independent probability of B Now, this Bayes theorem can be used to generate the following classification model –
Where
X = x1,x2,x3,.. xN аre list оf indeрendent рrediсtоrs y is the class label P(y|X) is the probability of label y given the predictors X The above equation may be extended as follows:
We are not considering Naive Bayes model from phase 2 onwards because the accuracy given by this model was the least during phase 1.
pipe_naive_bayes = Pipeline([
('scaler', StandardScaler()),
('classifier', GaussianNB())])
pipe_naive_bayes.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', GaussianNB())])
Library: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
Linear Regression Equation:
Where, y is dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
Apply Sigmoid function on linear regression:
Properties of Logistic Regression:
The dependent variable in logistic regression follows Bernoulli Distribution. Estimation is done through maximum likelihood. No R Square, Model fitness is calculated through Concordance, KS-Statistics.
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', GridSearchCV(LogisticRegression(solver='newton-cg', max_iter=1500),
param_grid={'C': [ 0.1,1,5,10.]},
cv=5,
refit=True))
])
pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier',
GridSearchCV(cv=5,
estimator=LogisticRegression(max_iter=1500,
solver='newton-cg'),
param_grid={'C': [0.1, 1, 5, 10.0]}))])
pipe.named_steps['classifier'].best_params_
{'C': 0.1}
Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.
Random forest works on the Bagging principle. Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation.
Steps involved in random forest algorithm:
Step 1: In Random forest n number of random records are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
param_grid = {
'n_estimators': [40,50,60],
'max_features': ['auto'],
'max_depth' : [15,20,25],
'criterion' :['entropy']
}
pipe1 = Pipeline([
('scaler', StandardScaler()),
('classifier', GridSearchCV(RandomForestClassifier(),
param_grid=param_grid,
cv=5,
refit=True))
])
pipe1.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier',
GridSearchCV(cv=5, estimator=RandomForestClassifier(),
param_grid={'criterion': ['entropy'],
'max_depth': [15, 20, 25],
'max_features': ['auto'],
'n_estimators': [40, 50, 60]}))])
Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
Ada-boost or Adaptive Boosting is one of ensemble boosting classifier proposed by Yoav Freund and Robert Schapire in 1996. It combines multiple classifiers to increase the accuracy of classifiers. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier. The basic concept behind Adaboost is to set the weights of classifiers and training the data sample in each iteration such that it ensures the accurate predictions of unusual observations. Any machine learning algorithm can be used as base classifier if it accepts weights on the training set. Adaboost should meet two conditions:
# ADABoost
from sklearn.ensemble import AdaBoostClassifier
param_grid = {
'n_estimators':[40,50,60],
'learning_rate':[0.01, 0.1, 1]
}
pipe2 = Pipeline([
('scaler', StandardScaler()),
('classifier', GridSearchCV(AdaBoostClassifier(),
param_grid=param_grid,
cv=5,
refit=True))
])
pipe2.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier',
GridSearchCV(cv=5, estimator=AdaBoostClassifier(),
param_grid={'learning_rate': [0.01, 0.1, 1],
'n_estimators': [40, 50, 60]}))])
Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
Bagging is a type of ensemble machine learning approach that combines the outputs from many learner to improve performance. These algorithms function by breaking down the training set into subsets and running them through various machine-learning models, after which combining their predictions when they return together to generate an overall prediction for each instance in the original data.
Bagging is commonly used in machine learning for classification problems, particularly when using decision trees or artificial neural networks as part of a boosting ensemble. It has been applied to various machine-learning algorithms including decision stumps, artificial neural networks (including multi-layer perceptron), support vector machines and maximum entropy classifiers. Bagging can be applied to regression problems, but it has been found to be lesser effective than for classification.
Bagging technique is also called bootstrap aggregation. It is a data sampling technique where data is sampled with replacement. Bootstrap aggregation is a machine learning ensemble meta-algorithm for reducing the variance of an estimate produced by bagging, which reduces its stability and enhances its bias. Bagging classifier helps combine prediction of different estimators and in turn helps reduce variance.
# Bagging
from sklearn.ensemble import BaggingClassifier
param_grid = {
'n_estimators':[40,50,60],
}
pipe3 = Pipeline([
('scaler', StandardScaler()),
('classifier', GridSearchCV(BaggingClassifier(),
param_grid=param_grid,
cv=4,
refit=True))
])
pipe3.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier',
GridSearchCV(cv=4, estimator=BaggingClassifier(),
param_grid={'n_estimators': [40, 50, 60]}))])
Library: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
XGBoost is short for “eXtreme Gradient Boosting.” The “eXtreme” refers to speed enhancements such as parallel computing and cache awareness that makes XGBoost approximately 10 times faster than traditional Gradient Boosting. In addition, XGBoost includes a unique split-finding algorithm to optimize trees, along with built-in regularization that reduces overfitting. Generally speaking, XGBoost is a faster, more accurate version of Gradient Boosting.
Boosting performs better than bagging on average, and Gradient Boosting is arguably the best boosting ensemble. Since XGBoost is an advanced version of Gradient Boosting, and its results are unparalleled, it’s arguably the best machine learning ensemble that we have.
# XGBoost
from xgboost import XGBClassifier
param_grid = {
'n_estimators':[40,50,60],
'learning_rate':[0.01, 0.1, 1]
}
pipe4 = Pipeline([
('scaler', StandardScaler()),
('classifier', GridSearchCV(XGBClassifier(),
param_grid=param_grid,
cv=4,
refit=True))
])
pipe4.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier',
GridSearchCV(cv=4,
estimator=XGBClassifier(base_score=None,
booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None,
learning_rate=None,
max_delta_step=None,
max_depth=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=100,
n_jobs=None,
num_parallel_tree=None,
random_state=None,
reg_alpha=None,
reg_lambda=None,
scale_pos_weight=None,
subsample=None,
tree_method=None,
validate_parameters=None,
verbosity=None),
param_grid={'learning_rate': [0.01, 0.1, 1],
'n_estimators': [40, 50, 60]}))])
Data leakage can cause you to create overly optimistic if not completely invalid predictive models. Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.
It is a serious problem for at least 3 reasons:
As machine learning practitioners, we are primarily concerned with this last case.
Do I have Data Leakage?
An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true. For the pipeline that we have used, we see that there is no data leakage as we have dealt with all NaN values appropriately, and the data type has been set uniform across columns. We have also ensured that the new features that we have generated during feature engineering have been used appropriately during training and is available at the time of inference.
In our industry, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. After finishing the building of the model, multiple metrics can be used to help in evaluating your model’s accuracy.
For this project, we are going to use the following performance metrics for each of the training models seperately:
where, y_ij, indicates whether sample i belongs to class j or not p_ij, indicates the probability of sample i belonging to class j
Confusion Matrix: Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.
ROC AUC: The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
Importing all the necessary metrics libraries:
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn import metrics
model_compare = []
print('Training accuracy: ' + str(pipe_naive_bayes.score(X_train,y_train)))
y_pred = pipe_naive_bayes.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe_naive_bayes.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.8528170731707317 Test accuracy: 0.8502764227642277 Log loss: 5.171345344476074 F1 Score: 0.8635969025926052 Confusion Matrix: [[50637 5860] [ 3348 1655]] ROC_AUC: 0.7273145927124152
model_compare.append(['Naive Bayes',pipe_naive_bayes.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe_naive_bayes.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe_naive_bayes, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa566c98e50>
print('Training accuracy: ' + str(pipe.score(X_train,y_train)))
y_pred = pipe.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.9193821138211382 Test accuracy: 0.9186016260162602 Log loss: 2.8114005629733114 F1 Score: 0.8803700985178491 Confusion Matrix: [[56472 25] [ 4981 22]] ROC_AUC: 0.7339861123947258
model_compare.append(['Logistic Regression',pipe.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa562ae7cd0>
print('Training accuracy: ' + str(pipe1.score(X_train,y_train)))
y_pred = pipe1.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe1.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.924150406504065 Test accuracy: 0.9186666666666666 Log loss: 2.8091541904986785 F1 Score: 0.8806527385293619 Confusion Matrix: [[56468 29] [ 4973 30]] ROC_AUC: 0.7362249163767931
model_compare.append(['Random Forest',pipe1.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe1.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe1, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa562ac3100>
print('Training accuracy: ' + str(pipe2.score(X_train,y_train)))
y_pred = pipe2.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe2.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.9194308943089431 Test accuracy: 0.918650406504065 Log loss: 2.8097154195729788 F1 Score: 0.879700196043292 Confusion Matrix: [[56497 0] [ 5003 0]] ROC_AUC: 0.6871204002911102
model_compare.append(['AdaBoost',pipe2.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe2.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe1, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa5671e99d0>
print('Training accuracy: ' + str(pipe3.score(X_train,y_train)))
y_pred = pipe3.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe3.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.9996138211382114 Test accuracy: 0.9182926829268293 Log loss: 2.8220726524496103 F1 Score: 0.8832652331886339 Confusion Matrix: [[56351 146] [ 4879 124]] ROC_AUC: 0.7011701239871685
model_compare.append(['Bagging',pipe3.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe3.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe3, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa5671e9190>
print('Training accuracy: ' + str(pipe4.score(X_train,y_train)))
y_pred = pipe4.predict(X_valid)
print('Test accuracy: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('F1 Score: ',f1_score(y_valid,y_pred,average = 'weighted'))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe4.predict_proba(X_valid)[:, 1]))
Training accuracy: 0.9194552845528455 Test accuracy: 0.9186341463414635 Log loss: 2.810277051696389 F1 Score: 0.8797239210009311 Confusion Matrix: [[56495 2] [ 5002 1]] ROC_AUC: 0.7393593845285834
model_compare.append(['XGBoost',pipe4.score(X_train,y_train),accuracy_score(y_valid,y_pred),log_loss(y_valid,y_pred),f1_score(y_valid,y_pred,average = 'weighted'),roc_auc_score(y_valid, pipe4.predict_proba(X_valid)[:, 1])])
metrics.plot_roc_curve(pipe4, X_valid, y_valid)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fa56320acd0>
pd.DataFrame(model_compare,columns = ['Model Name', 'Training Accuracy', 'Test Accuracy','LogLoss','F1','ROC-AUC'])
| Model Name | Training Accuracy | Test Accuracy | LogLoss | F1 | ROC-AUC | |
|---|---|---|---|---|---|---|
| 0 | Naive Bayes | 0.852817 | 0.850276 | 5.171345 | 0.863597 | 0.727315 |
| 1 | Logistic Regression | 0.919382 | 0.918602 | 2.811401 | 0.880370 | 0.733986 |
| 2 | Random Forest | 0.924150 | 0.918667 | 2.809154 | 0.880653 | 0.736225 |
| 3 | AdaBoost | 0.919431 | 0.918650 | 2.809715 | 0.879700 | 0.687120 |
| 4 | Bagging | 0.999614 | 0.918293 | 2.822073 | 0.883265 | 0.701170 |
| 5 | XGBoost | 0.919455 | 0.918634 | 2.810277 | 0.879724 | 0.739359 |
Here we see that the Bagging Classifier has the highest training accuracy of 99.9%, but an accuracy like this carries the risk of overfitting. As for the XGBoost model, the training accuracy is 91.9%. This score is good enough and appears to be a reliable model. Based on the ROC Area Under Curve values for the XGBoost model, the ROC values is 0.739, showing a large amount of True Positive values, indicating a good fit to the data. As shown in the combined table above, compared to all other models, XGBoost seems to be the best fitting model based on ROC-AUC value.
Machine learning competitions are a great way to improve your skills and measure your progress as a data scientist. If you are using data from a competition on Kaggle, you can easily submit it from your notebook. We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the housing data is the string 'Id'). The prediction column will use the name of the target field.
df_numerical.drop(columns = ['TARGET'],inplace=True)
df_cols = list(df_numerical.columns)+ list(df_categorical.columns)
df_cols
df_test
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().drop(
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | Cash loans | F | N | Y | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 48740 | 456222 | Cash loans | F | N | N | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48741 | 456223 | Cash loans | F | Y | Y | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 |
| 48742 | 456224 | Cash loans | M | N | N | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 48743 | 456250 | Cash loans | F | Y | N | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
48744 rows × 121 columns
df_test_final = df_test[df_cols]
df_test_final
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | 0.018850 | -19241 | -2329 | -5170.0 | ... | F | N | Y | Unaccompanied | Working | Higher education | Married | House / apartment | TUESDAY | Kindergarten |
| 1 | 100005 | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | 0.035792 | -18064 | -4469 | -9118.0 | ... | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | FRIDAY | Self-employed |
| 2 | 100013 | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | 0.019101 | -20038 | -4458 | -2175.0 | ... | M | Y | Y | NaN | Working | Higher education | Married | House / apartment | MONDAY | Transport: type 3 |
| 3 | 100028 | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | 0.026392 | -13976 | -1866 | -2000.0 | ... | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | WEDNESDAY | Business Entity Type 3 |
| 4 | 100038 | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | 0.010032 | -13040 | -2191 | -4000.0 | ... | M | Y | N | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | FRIDAY | Business Entity Type 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221 | 0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | 0.002042 | -19970 | -5169 | -9094.0 | ... | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Widow | House / apartment | WEDNESDAY | Other |
| 48740 | 456222 | 2 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | 0.035792 | -11186 | -1149 | -3015.0 | ... | F | N | N | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | MONDAY | Trade: type 7 |
| 48741 | 456223 | 1 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | 0.026392 | -15922 | -3037 | -2681.0 | ... | F | Y | Y | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | WEDNESDAY | Business Entity Type 3 |
| 48742 | 456224 | 0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | 0.018850 | -13968 | -2731 | -1461.0 | ... | M | N | N | Family | Commercial associate | Higher education | Married | House / apartment | MONDAY | Self-employed |
| 48743 | 456250 | 0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | 0.006629 | -13962 | -633 | -1072.0 | ... | F | Y | N | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | TUESDAY | Government |
48744 rows × 88 columns
num_pipeline1 = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])
cat_pipeline1 = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))])
df_test_final_num = df_test_final.select_dtypes(exclude='object')
df_test_final_cat = df_test_final.select_dtypes(include='object')
data_pipeline1 = ColumnTransformer([
("num_pipeline", num_pipeline1, df_test_final_num.columns),
("cat_pipeline", cat_pipeline1, df_test_final_cat.columns)], n_jobs = -1)
df_transformed1 = data_pipeline1.fit_transform(df_test_final)
column_names = list(df_test_final_num.columns) + \
list(data_pipeline1.transformers_[1][1].named_steps["ohe"].get_feature_names(df_test_final_cat.columns))
df_transformed1.shape
(48744, 180)
df_final_test = pd.DataFrame(df_transformed1,columns=column_names)
df_final_test = pd.merge(left=df_final_test, right=previous_application, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
#df_final_test = pd.merge(left=df_final_test, right=clean_bureau, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_final_test = pd.merge(left=df_final_test, right=clean_bureau, how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_final_test
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD_y | SK_DPD_DEF_y | SK_ID_BUREAU | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001.0 | 0.0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | 0.018850 | -19241.0 | -2329.0 | -5170.0 | ... | NaN | NaN | NaN | NaN | 5896633.0 | -857.0 | -715.0 | 168345.00 | -155.0 | -28.0 |
| 1 | 100005.0 | 0.0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | 0.035792 | -18064.0 | -4469.0 | -9118.0 | ... | NaN | NaN | NaN | NaN | 6735201.0 | -137.0 | -123.0 | 58500.00 | -31.0 | -4.0 |
| 2 | 100013.0 | 0.0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | 0.019101 | -20038.0 | -4458.0 | -2175.0 | ... | 0.0 | 22.0 | 0.0 | 0.0 | 5922081.5 | -1835.0 | -1168.0 | 391770.00 | -882.0 | -59.5 |
| 3 | 100028.0 | 2.0 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | 0.026392 | -13976.0 | -1866.0 | -2000.0 | ... | 2.0 | 19.5 | 0.0 | 0.0 | 6356884.5 | -1612.0 | -1375.0 | 129614.04 | -683.5 | -52.5 |
| 4 | 100038.0 | 1.0 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | 0.010032 | -13040.0 | -2191.0 | -4000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 456221.0 | 0.0 | 121500.0 | 412560.0 | 17473.5 | 270000.0 | 0.002042 | -19970.0 | -5169.0 | -9094.0 | ... | NaN | NaN | NaN | NaN | 6645689.0 | -601.0 | -603.0 | 145867.50 | -99.0 | -19.0 |
| 48740 | 456222.0 | 2.0 | 157500.0 | 622413.0 | 31909.5 | 495000.0 | 0.035792 | -11186.0 | -1149.0 | -3015.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 48741 | 456223.0 | 1.0 | 202500.0 | 315000.0 | 33205.5 | 315000.0 | 0.026392 | -15922.0 | -3037.0 | -2681.0 | ... | NaN | NaN | NaN | NaN | 6433063.0 | -349.0 | -406.0 | 54000.00 | -159.0 | -11.0 |
| 48742 | 456224.0 | 0.0 | 225000.0 | 450000.0 | 25128.0 | 450000.0 | 0.018850 | -13968.0 | -2731.0 | -1461.0 | ... | NaN | NaN | NaN | NaN | 6436438.0 | -1421.0 | -1513.0 | 147339.00 | -1058.0 | -46.0 |
| 48743 | 456250.0 | 0.0 | 135000.0 | 312768.0 | 24709.5 | 270000.0 | 0.006629 | -13962.0 | -633.0 | -1072.0 | ... | 0.0 | 4.5 | 0.0 | 0.0 | 6817237.0 | -824.0 | -760.0 | 483349.50 | -31.0 | -27.0 |
48744 rows × 240 columns
df_final_test['FEATURE1']= df_final_test['AMT_TOTAL_RECEIVABLE']/(df_final_test['AMT_BALANCE']+1)
df_final_test['FEATURE2'] = df_final_test['AMT_TOTAL_RECEIVABLE']/(df_final_test['AMT_RECIVABLE']+1)
df_final_test['FEATURE3'] = df_final_test['AMT_TOTAL_RECEIVABLE']/(df_final_test['AMT_RECEIVABLE_PRINCIPAL']+1)
df_final_test['FEATURE4']=df_final_test['AMT_CREDIT_x'] / (df_final_test['AMT_INCOME_TOTAL']+1)
df_final_test['FEATURE5']=df_final_test['AMT_ANNUITY_x'] / (df_final_test['AMT_INCOME_TOTAL']+1)
df_final_test['FEATURE6']= df_final_test['AMT_ANNUITY_x'] / (df_final_test['AMT_CREDIT_x'] +1)
df_final_test['FEATURE7']=(df_final_test['EXT_SOURCE_1']*df_final_test['EXT_SOURCE_2']*df_final_test['EXT_SOURCE_3'])
#df_final_test['FEATURE8']=df_final_test['NAME_TYPE_SUITE_Spouse, partner'] / (df_final_test['REGION_RATING_CLIENT_W_CITY']+1)
#df_final_test['FEATURE9']=df_final_test['REGION_RATING_CLIENT'] / (df_final_test['REGION_RATING_CLIENT_W_CITY']+1)
df_final_test.rename(columns={'AMT_CREDIT_y':'AMT_CREDIT'},inplace=True)
df_final_test = df_final_test.apply(lambda x: x.fillna(x.median()),axis=0)
print(clean_bureau.columns)
Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'DAYS_CREDIT', 'DAYS_ENDDATE_FACT',
'AMT_CREDIT_SUM', 'DAYS_CREDIT_UPDATE', 'MONTHS_BALANCE'],
dtype='object')
corr_greater_than_5
| col_name | Correlation | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | DAYS_CREDIT | 0.079099 |
| 2 | DAYS_BIRTH | 0.078236 |
| 3 | DAYS_CREDIT_UPDATE | 0.063527 |
| 4 | REGION_RATING_CLIENT_W_CITY | 0.060875 |
| 5 | REGION_RATING_CLIENT | 0.058882 |
| 6 | NAME_INCOME_TYPE_Working | 0.057504 |
| 7 | DAYS_LAST_PHONE_CHANGE | 0.055228 |
| 8 | CODE_GENDER_M | 0.054729 |
| 9 | DAYS_ID_PUBLISH | 0.051455 |
| 10 | FEATURE2 | 0.051093 |
| 11 | REG_CITY_NOT_WORK_CITY | 0.050981 |
| 242 | CODE_GENDER_F | -0.054729 |
| 243 | NAME_EDUCATION_TYPE_Higher education | -0.056578 |
| 244 | EXT_SOURCE_1 | -0.099162 |
| 245 | EXT_SOURCE_3 | -0.157409 |
| 246 | EXT_SOURCE_2 | -0.160283 |
| 247 | FEATURE7 | -0.189587 |
selected_features = corr_greater_than_5['col_name']
#selected_features.to_csv('final_application_test.csv')
corr_greater_than_5.drop(0)
| col_name | Correlation | |
|---|---|---|
| 1 | DAYS_CREDIT | 0.079099 |
| 2 | DAYS_BIRTH | 0.078236 |
| 3 | DAYS_CREDIT_UPDATE | 0.063527 |
| 4 | REGION_RATING_CLIENT_W_CITY | 0.060875 |
| 5 | REGION_RATING_CLIENT | 0.058882 |
| 6 | NAME_INCOME_TYPE_Working | 0.057504 |
| 7 | DAYS_LAST_PHONE_CHANGE | 0.055228 |
| 8 | CODE_GENDER_M | 0.054729 |
| 9 | DAYS_ID_PUBLISH | 0.051455 |
| 10 | FEATURE2 | 0.051093 |
| 11 | REG_CITY_NOT_WORK_CITY | 0.050981 |
| 242 | CODE_GENDER_F | -0.054729 |
| 243 | NAME_EDUCATION_TYPE_Higher education | -0.056578 |
| 244 | EXT_SOURCE_1 | -0.099162 |
| 245 | EXT_SOURCE_3 | -0.157409 |
| 246 | EXT_SOURCE_2 | -0.160283 |
| 247 | FEATURE7 | -0.189587 |
selected_features = corr_greater_than_5['col_name'].drop(0)
new_df_final_test = df_final_test[selected_features]
new_df_final_test
#new_df_final_test.to_csv('final_application_test.csv')
| DAYS_CREDIT | DAYS_BIRTH | DAYS_CREDIT_UPDATE | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | NAME_INCOME_TYPE_Working | DAYS_LAST_PHONE_CHANGE | CODE_GENDER_M | DAYS_ID_PUBLISH | FEATURE2 | REG_CITY_NOT_WORK_CITY | CODE_GENDER_F | NAME_EDUCATION_TYPE_Higher education | EXT_SOURCE_1 | EXT_SOURCE_3 | EXT_SOURCE_2 | FEATURE7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -857.0 | -19241.0 | -155.00 | 2.0 | 2.0 | 1.0 | -1740.0 | 0.0 | -812.0 | 0.000000 | 0.0 | 1.0 | 1.0 | 0.752614 | 0.159520 | 0.789654 | 0.094803 |
| 1 | -137.0 | -18064.0 | -31.00 | 2.0 | 2.0 | 1.0 | 0.0 | 1.0 | -1623.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.564990 | 0.432962 | 0.291656 | 0.071345 |
| 2 | -1835.0 | -20038.0 | -882.00 | 2.0 | 2.0 | 1.0 | -856.0 | 1.0 | -3503.0 | 0.000000 | 0.0 | 0.0 | 1.0 | 0.501180 | 0.610991 | 0.699787 | 0.214286 |
| 3 | -1612.0 | -13976.0 | -683.50 | 2.0 | 2.0 | 1.0 | -1805.0 | 0.0 | -4208.0 | 0.999863 | 0.0 | 1.0 | 0.0 | 0.525734 | 0.612704 | 0.509677 | 0.164177 |
| 4 | -981.0 | -13040.0 | -320.25 | 2.0 | 2.0 | 1.0 | -821.0 | 1.0 | -4262.0 | 0.000000 | 1.0 | 0.0 | 0.0 | 0.202145 | 0.500106 | 0.425687 | 0.043034 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | -601.0 | -19970.0 | -99.00 | 3.0 | 3.0 | 1.0 | -684.0 | 0.0 | -3399.0 | 0.000000 | 0.0 | 1.0 | 0.0 | 0.501180 | 0.643026 | 0.648575 | 0.209017 |
| 48740 | -981.0 | -11186.0 | -320.25 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | -3003.0 | 0.000000 | 1.0 | 1.0 | 0.0 | 0.501180 | 0.500106 | 0.684596 | 0.171589 |
| 48741 | -349.0 | -15922.0 | -159.00 | 2.0 | 2.0 | 0.0 | -838.0 | 0.0 | -1504.0 | 0.000000 | 0.0 | 1.0 | 0.0 | 0.733503 | 0.283712 | 0.632770 | 0.131682 |
| 48742 | -1421.0 | -13968.0 | -1058.00 | 2.0 | 2.0 | 0.0 | -2308.0 | 1.0 | -1364.0 | 0.000000 | 1.0 | 0.0 | 1.0 | 0.373090 | 0.595456 | 0.445701 | 0.099016 |
| 48743 | -824.0 | -13962.0 | -31.00 | 2.0 | 2.0 | 1.0 | -327.0 | 0.0 | -4220.0 | 0.999994 | 0.0 | 1.0 | 0.0 | 0.501180 | 0.272134 | 0.456541 | 0.062267 |
48744 rows × 17 columns
X = new_df_final_test.values
result = pipe4.predict(X)
result_prob = pipe4.predict_proba(X)
r = pd.DataFrame(result_prob,columns=['class_0_prob','class_1_prob'])
r[['class_0_prob','class_1_prob']] = result_prob
final_sub = pd.DataFrame()
final_sub['SK_ID_CURR'] = df_test['SK_ID_CURR']
final_sub['TARGET'] = r['class_1_prob']
final_sub = final_sub.set_index('SK_ID_CURR')
final_sub
| TARGET | |
|---|---|
| SK_ID_CURR | |
| 100001 | 0.059373 |
| 100005 | 0.117161 |
| 100013 | 0.031515 |
| 100028 | 0.050429 |
| 100038 | 0.133859 |
| ... | ... |
| 456221 | 0.040164 |
| 456222 | 0.072552 |
| 456223 | 0.073067 |
| 456224 | 0.056120 |
| 456250 | 0.172440 |
48744 rows × 1 columns
final_sub.to_csv('submission.csv')
Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.
Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Neural networks rely on training data to learn and improve their accuracy over time. However, once these learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science and artificial intelligence, allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the most well-known neural networks is Google’s search algorithm.
import torch
import torch.nn as nn
import torch.nn.functional as Function
import torch.optim as optim
import torchvision
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/")
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
def corr_target(df,cor):
correlation = df.corr()['TARGET'].sort_values(ascending=False).reset_index()
correlation.columns = ['col_name','Correlation']
after_correlation = correlation[abs(correlation['Correlation'])>cor]
return after_correlation
# Taking features that have correlation of at least 10% with target variable.
train_df = pd.read_csv('final_application_train.csv')
train_df = train_df.iloc[:,1:]
train_df_cols = corr_target(train_df,0.1)
train_df = train_df[train_df_cols['col_name']]
train_df
| TARGET | EXT_SOURCE_3 | EXT_SOURCE_2 | FEATURE7 | |
|---|---|---|---|---|
| 0 | 1.0 | 0.139376 | 0.262949 | 0.003043 |
| 1 | 0.0 | 0.510856 | 0.622246 | 0.098945 |
| 2 | 0.0 | 0.729567 | 0.555912 | 0.203649 |
| 3 | 0.0 | 0.510856 | 0.650442 | 0.166847 |
| 4 | 0.0 | 0.510856 | 0.322738 | 0.082786 |
| ... | ... | ... | ... | ... |
| 307495 | 0.0 | 0.510856 | 0.681632 | 0.050690 |
| 307496 | 0.0 | 0.510856 | 0.115992 | 0.029753 |
| 307497 | 0.0 | 0.218859 | 0.535722 | 0.087235 |
| 307498 | 1.0 | 0.661024 | 0.514163 | 0.170659 |
| 307499 | 0.0 | 0.113922 | 0.708569 | 0.059287 |
307500 rows × 4 columns
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
X = train_df.drop('TARGET',axis=1).values
y = train_df['TARGET'].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=123)
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.tensor(y_train, dtype=torch.long, device=device)
y_test = torch.tensor(y_test, dtype=torch.long, device=device)
# Implementin Neural Network with 4 Hidden Layers
class Neural_Network(nn.Module):
def __init__(self,):
super().__init__()
self.input_layer_size=3
self.hidden_layer1_size=128
self.hidden_layer2_size=64
self.hidden_layer3_size=32
self.hidden_layer4_size=10
self.output_layer_size=2
#self.softmax= nn.Softmax(dim=1)
self.W1 = nn.Linear(self.input_layer_size, self.hidden_layer1_size)
self.W2 = nn.Linear( self.hidden_layer1_size, self.hidden_layer2_size)
self.W3 = nn.Linear( self.hidden_layer2_size, self.hidden_layer3_size)
self.W4 = nn.Linear( self.hidden_layer3_size,self.hidden_layer4_size)
self.out = nn.Linear(self.hidden_layer4_size,self.output_layer_size)
def forward(self,x):
x = Function.relu(self.W1(x))
x = Function.relu(self.W2(x))
x = Function.relu(self.W3(x))
x = Function.relu(self.W4(x))
x = self.out(x)
#x = self.softmax(x)
return x
# Defining Loss Function and Optimizer
torch.manual_seed(20)
model= Neural_Network().to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
pred = []
with torch.no_grad():
for i,data in enumerate(X_test):
y_pred = model(data)
pred.append(y_pred.argmax().item())
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:')
confusion_matrix(y_test,pred)
Confusion Matrix:
array([[56554, 0],
[ 4946, 0]])
from sklearn.metrics import f1_score
print('F1 Score: ',f1_score(y_test,pred,average = 'weighted'))
F1 Score: 0.8810505529989652
from sklearn.metrics import log_loss
print('Log Loss: ',log_loss(y_test,pred))
Log Loss: 2.777703870719159
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test,pred)
auc
0.5
# Calculating Loss at different iterations and calculating the accuracy
iterations = 500
total_loss = []
total_acc = []
shape = y_train.shape[0]
for i in range(iterations):
i +=1
y_pred = model.forward(X_train)
loss = loss_function(y_pred,y_train)
total_loss.append(loss.item())
_,predicted = torch.max(y_pred,1)
acc = (predicted==y_train).sum().item()
total_acc.append(acc/shape)
if i%50 == 1:
print("Loss for Iteration", i,'is',(loss.item()))
print('Train_Accuracy for Iteration',i,'is', (acc/shape)*100,'%')
writer.add_scalar('Training loss',loss.item(),i)
writer.add_scalar('Accuracy',acc/shape,i)
optimizer.zero_grad()
loss.backward()
Loss for Iteration 1 is 0.6235156655311584 Train_Accuracy for Iteration 1 is 91.91991869918698 % Loss for Iteration 51 is 0.6235156655311584 Train_Accuracy for Iteration 51 is 91.91991869918698 % Loss for Iteration 101 is 0.6235156655311584 Train_Accuracy for Iteration 101 is 91.91991869918698 % Loss for Iteration 151 is 0.6235156655311584 Train_Accuracy for Iteration 151 is 91.91991869918698 % Loss for Iteration 201 is 0.6235156655311584 Train_Accuracy for Iteration 201 is 91.91991869918698 % Loss for Iteration 251 is 0.6235156655311584 Train_Accuracy for Iteration 251 is 91.91991869918698 % Loss for Iteration 301 is 0.6235156655311584 Train_Accuracy for Iteration 301 is 91.91991869918698 % Loss for Iteration 351 is 0.6235156655311584 Train_Accuracy for Iteration 351 is 91.91991869918698 % Loss for Iteration 401 is 0.6235156655311584 Train_Accuracy for Iteration 401 is 91.91991869918698 % Loss for Iteration 451 is 0.6235156655311584 Train_Accuracy for Iteration 451 is 91.91991869918698 %
%load_ext tensorboard
The tensorboard extension is already loaded. To reload it, use: %reload_ext tensorboard
!pip install tensorboard
Requirement already satisfied: tensorboard in /opt/anaconda3/lib/python3.8/site-packages (2.8.0)
Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (2.6.6)
Requirement already satisfied: protobuf>=3.6.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (3.20.1)
Requirement already satisfied: requests<3,>=2.21.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (2.25.1)
Requirement already satisfied: grpcio>=1.24.3 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (1.44.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (1.8.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (0.4.6)
Requirement already satisfied: markdown>=2.6.8 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (3.3.6)
Requirement already satisfied: numpy>=1.12.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (1.20.1)
Requirement already satisfied: setuptools>=41.0.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (52.0.0.post20210125)
Requirement already satisfied: wheel>=0.26 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (0.36.2)
Requirement already satisfied: werkzeug>=0.11.15 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (1.0.1)
Requirement already satisfied: absl-py>=0.4 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (1.0.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/anaconda3/lib/python3.8/site-packages (from tensorboard) (0.6.1)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from absl-py>=0.4->tensorboard) (1.15.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/anaconda3/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (5.0.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/anaconda3/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (4.8)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/anaconda3/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/anaconda3/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /opt/anaconda3/lib/python3.8/site-packages (from markdown>=2.6.8->tensorboard) (4.11.3)
Requirement already satisfied: zipp>=0.5 in /opt/anaconda3/lib/python3.8/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard) (3.4.1)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/anaconda3/lib/python3.8/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard) (0.4.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /opt/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard) (4.0.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/anaconda3/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard) (3.2.0)
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.
%tensorboard --logdir=runs
with torch.no_grad():
plt.figure(figsize=(7,7))
plt.plot(range(iterations), total_loss)
plt.ylabel('Loss')
plt.xlabel('Number of Iterations')
with torch.no_grad():
plt.figure(figsize=(7,7))
plt.plot(range(iterations), total_acc)
plt.ylabel('Training Accuracy')
plt.xlabel('Number of Iterations')
# Testing Accuracy
from sklearn.metrics import accuracy_score
test_acc = accuracy_score(y_test,pred)
print(test_acc * 100,'%')
91.95772357723577 %
test_df = pd.read_csv('final_application_test.csv')
test_df = test_df.iloc[:,1:]
test_cols = list(train_df.columns)
test_cols.remove('TARGET')
testing = torch.FloatTensor(test_df[test_cols].values)
prediction = []
probability = []
with torch.no_grad():
for i,data in enumerate(testing):
y_pred = model(data)
prediction.append(y_pred.argmax().item())
probability.append(Function.softmax(y_pred)[1].item())
<ipython-input-151-59b097dfe771>:7: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. probability.append(Function.softmax(y_pred)[1].item())
sub = pd.DataFrame()
sub["SK_ID_CURR"]= df_test['SK_ID_CURR']
sub['TARGET'] = probability
sub.set_index('SK_ID_CURR')
sub.to_csv('submission_nn4.csv',index=False)
The following are the results achieved through trial and error considering different number of hidden layers:
We can see from the table that as the number of hidden layers increase, the accuarcy also increases.
The results for our project submission on kaggle is as follows:
Note: We did not submit the file for Naive Bayes because the accuracy is way too low to begin with.
Summary of kaggle submission:
Best model: XGBoost
We are attempting to predict whether the credit-less population will be able to repay their loans. We sourced our data from the Home Credit dataset in order to realize this goal. Having a fair chance to obtain a loan is extremely important to this population, and as students we have a strong connection with this. As a result, we have decided to pursue this project. During the first phase, we begin to experiment with the dataset. After performing OHE on the data, we used imputation techniques to fix it before feeding it into the model. In phase 2, we implemented feature engineering and hyperparameter tuning to refine the results.
Finally, we have evaluated the results using accuracy score, log loss, confusion matrix and ROC AUC scores. In this last phase we implemented a few more other models such as AdaBoost, XGBoost and bagging to the previously used models, namely, logistic regression, random forest and naive bayes. Additionally, we also implemented Multilayer Perceptron(MLP) model using Pytorch for loan default classification. We found out that the training accuracy for the MLP model to be 91.92% and a test accuracy of 91.96% which is pretty close to our previous non deep learning models. Deep Learning models require huge amount of data to train itself and thus on a longer run Deep Learning models would work best for HCDR classification as compared to usual supervised models.
The best fitting model is XGBoost with the following scores:
The future scope for this project can include using embeddings in deep learning models or using some advanced classification models like lightGBM/other boosting models that can produce better results. Features can be refined more to increase the accuracy of the model.
Pipeline: https://scikit-learn.org/stable/
Data: https://www.kaggle.com/c/home-credit-default-risk/data
Metrics: https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234
Professor's baseline notebook: https://github.iu.edu/jshanah/I526_AML_Student/blob/master/Assignments/Unit-Project-Home-Credit-Default-Risk/HCDR_Phase_1_baseline_submission/HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb